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Abstract 

Text  which  either  appears  in  a  scene  or  is  graphically  added  to  video  can  provide  an  important 
supplemental  source  of  index  information  as  well  as  clues  for  decoding  the  video’s  structure 
and  for  classihcation.  In  this  paper  we  present  algorithms  for  detecting  and  tracking  text 
components  that  appear  within  digital  video  frames.  Onr  system  implements  a  scale- space 
feature  extractor  that  feeds  an  artihcial  neural  processor  to  extract  textual  regions  and  track 
their  movement  over  time.  The  extracted  regions  can  then  be  used  as  input  to  an  appropriate 
Optical  Character  Recognition  system  which  produces  indexible  keywords. 

Keywords:  Text  Detection,  Text  Tracking,  Video  Indexing,  Digital  Libraries,  Neural  Network, 
Wavelet 
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1  Introduction 


The  increasing  availability  of  online  digital  imagery  and  video  has  rekindled  interest  in  the 
problems  of  how  to  index  multimedia  information  sources  automatically  and  how  to  browse 
and  manipulate  them  efhciently.  Traditionally,  images  and  video  sequences  have  been  manually 
annotated  with  a  small  number  of  keyword  descriptors  after  visual  inspection  by  a  human 
reviewer.  Unfortunately,  this  process  can  be  very  time-consuming,  and  such  delays,  although 
perhaps  acceptable  for  archival  applications,  may  inhibit  the  abihty  to  perform  near-real-time 
hltering  and  retrieval. 

For  some  media,  the  problem  of  indexing  is  better  understood  and  has  been  more  thoroughly 
addressed.  For  example,  there  has  been  tremendous  success  in  the  automatic  conversion  of 
hard-copy  documents  via  optical  character  recognition  (OCR)  technology  [20,  22,  35,  41]  and 
the  transcription  of  speech  via  voice  recognition  (VR)  technology  [24,  43].  In  both  cases,  the 
output,  although  typically  less  than  perfect,  is  an  ASCII  text  representation  which  can  be 
indexed  with  traditional  information  retrieval  techniques.  Unfortunately,  when  dealing  with 
visual  information,  we  do  not  always  have  a  linguistic  representation  of  content.  We  do  hud, 
however,  that  some  information-rich  video  sources  such  as  newscasts,  commercials,  movies  and 
sporting  events  contain  meaningful  content  in  the  form  of  voice,  closed-caption  text  and/or 
text  in  the  image.  These  sources  often  complement  each  other,  and  the  ability  to  use  them  can 
provide  essential  supplemental  information  useful  for  indexing  and  retrieval. 

In  this  paper,  we  address  only  the  problem  of  detecting  and  tracking  text  in  digital  video. 
Although  sound  and  closed  captions  provide  index  information  on  the  spoken  content,  basic 
annotational  information  often  appears  only  in  the  image  text.  For  example,  sports  scores, 
product  names,  scene  locations,  speaker  names,  movie  credits,  program  introductions  and  spe¬ 
cial  announcements  often  appear  and  supplement  or  summarize  the  visual  content.  These  types 
of  annotations  are  usually  rendered  in  high  contrast  with  respect  to  the  background,  are  of  read¬ 
able  quality,  and  use  keywords  that  facihtate  indexing.  Specihc  searches  for  a  particular  actor 
or  reference  to  a  particular  story  can  easily  be  reahzed  if  there  exists  access  to  this  textual 
content. 

New  standards  in  video  coding  are  extending  the  scope  of  video  to  streaming  multime¬ 
dia.  The  efforts  of  the  Motion  Picture  Experts  Group  (MPEG),  part  of  the  joint  International 
Organization  for  Standards  (ISO)  and  International  Engineering  Consortium,  are  producing 
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standards  which  are  object-based.  Within  these  standards  video  can  be  encoded  as  a  static 
background  with  moving  foreground  composed  of  various  objects,  ultimately  allowing  annota¬ 
tion  of  textual  content  as  an  object  and  making  it  easier  to  extract  and  use  as  an  indexing  tool. 
Unfortunately,  the  standards  are  not  yet  widely  implemented  and  are  not  expected  to  be  ac¬ 
cepted  as  mainstream  for  some  years.  Nevertheless,  the  extraction  of  textual  content  rendered 
as  part  of  the  video  and  text  embedded  in  the  scene  is  extremely  valuable  for  indexing. 

2  Background 

Before  we  embark  on  the  general  subject  of  text  extraction  it  is  useful  to  consider  some  specihc 
aspects  of  the  problem.  We  hrst  discuss  synthetic  and  natural  sources  of  textual  content  within 
video  frames  and  then  review  some  related  work  and  present  onr  approach. 

2.1  Scene  Text  and  Graphic  Text 

At  a  high  level,  text  can  be  divided  into  two  classes,  scene  text  and  graphic  text.  Scene  text 
appears  within  the  scene  which  is  then  captured  by  the  recording  device.  It  is  an  integral  part 
of  the  image  and  can  be  considered  a  sample  of  the  real  world.  Examples  of  scene  text  include 
street  signs,  billboards,  text  on  trucks  and  writing  on  shirts.  Although  valuable,  the  appearance 
of  such  text  is  typically  incidental  to  the  scene  content,  and  only  useful  in  applications  such 
as  navigation,  surveillance  or  reading  text  appearing  on  known  objects,  rather  than  general 
indexing  and  retrieval.  One  exception  is  in  constrained  domains,  where  text  or  symbols  may 
be  used  to  identify  players  or  vehicles.  Scene  text  is  often  difhcnlt  to  detect  and  extract  since 
it  may  appear  in  a  virtually  unlimited  number  of  poses,  sizes,  shapes  and  colors. 

Graphic  text,  on  the  other  hand,  is  text  that  is  mechanically  added  to  video  frames  to 
supplement  the  visual  and  audio  content,  and  is  often  more  structured  and  closely  related  to 
the  subject  than  scene  text  is.  Examples  of  graphic  text  include  headlines,  keyword  summaries, 
time  and  location  stamps,  names  of  people  and  sports  scores.  The  descriptors  are  typically 
predictable,  have  simple  styles,  and  are  produced  with  the  intent  of  being  read  by  the  viewer. 

Graphic  text  has  a  number  of  functions  which  differ  between  domains.  In  commercials,  text 
appears  to  reinforce  the  vital  information  such  as  the  product  name,  claims  made,  or  in  some 
cases,  to  provide  disclaimers.  In  sporting  events,  text  is  used  to  identify  specihc  players,  provide 
game  information,  or  relay  statistics.  In  newscasts,  graphic  text  can  be  used  to  either  identify 
key  features  of  the  scene,  such  as  location  (White  House  lawn)  or  speaker  (Bill  and  Hillary), 
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to  provide  a  synopsis  of  the  topic  (Blizzard  97),  or  to  provide  a  visual  summary  of  statistical 
information.  In  movies  and  television  shows,  text  provides  production  and  acting  credits,  and 
in  other  cases  captions  or  language  translations.  For  the  most  part,  research  in  this  area  has 
focused  on  the  identihcation  of  graphic  text. 

2.2  Related  Work 

In  related  domains  there  has  been  work  on  the  extraction  of  text  from  road  signs  [32,  44],  license 
plates  [10,  29,  31],  hbrary  books  [19],  WWW  images  [59,  60],  scene  images  [30,  42,  57,  58]  and 
isolated  video  frames  [33,  36,  53].  The  methods  can  be  broadly  classihed  into  two  types.  The 
hrst  is  connected  component  (CC)  based.  Unhke  binary  document  images,  scene  images  and 
video  frames  are  usually  multivalued  (gray-scale  or  colored),  and  therefore  multivalued  image 
decomposition  is  required  before  the  connected  components  can  be  extracted.  The  general  idea 
of  the  CC-based  methods  can  be  described  as:  In  a  multivalued  image  I  with  pixel  values 
ji  6  {0,1, —  1},  where  U  is  an  integer  greater  than  1,  I  can  be  decomposed  into  a  set  of 
elementary  images  P  =  {li}  with  each  li  having  the  same  pixel  value.  Under  the  assumption 
that  text  is  represented  with  a  uniform  color  (or  intensity),  connected  components  are  extracted 
for  each  elementary  image  !{.  Some  heuristic  restrictions  on  component  size,  number  of  ahgned 
components  and  line  orientation  are  used  to  identify  text  lines.  If  /  is  a  color  image,  color 
clustering  techniques  are  required  [26,  28,  59,  60],  whereas  for  gray-scale  images,  a  binarization 
process  is  typically  used  [30,  42].  Depending  on  the  complexity  of  the  color  clustering  or 
binarization  method  used,  CC-based  methods  can  locate  text  quickly,  but  have  difhcnlties 
when  text  is  embedded  in  complex  backgrounds  or  touches  other  text  or  graphical  objects. 

The  second  approach  is  texture-based  and  uses  weU-known  texture  analysis  methods  such  as 
Gabor  hltering  [25],  Gaussian  hltering  [56]  or  spatial  variance  [58]  to  locate  text  regions.  In  [25] 
Jain  describes  a  method  of  separating  text  and  image  areas  based  on  a  group  of  multichannel 
Gabor  hlters.  Wn  and  Manmatha  present  a  text  extraction  system  based  on  Gaussian  hltering 
[56]  and  treating  text  as  a  distinctive  texture.  After  hltering,  each  pixel  in  the  original  image 
will  be  represented  by  a  feature  vector  that  consists  of  the  energy  calculated  from  the  hltered 
images.  Clustering  methods,  such  as  A- means,  are  then  used  to  cluster  these  feature  vectors. 

For  text  detection  in  digital  video,  some  authors  use  interframe  analysis  to  add  missing 
characters  or  to  delete  incorrectly  identihed  regions.  Shim  detects  text  in  hve  consecutive 
frames  and  then  examines  the  similarity  of  text  regions  in  terms  of  their  positions,  intensities 
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and  shapes  [53].  In  [36]  Lienhart  extracts  text  in  one  frame  and  then  checks  if  the  region 
corresponds  to  text  in  the  following  frame  by  using  simple  region  matching.  Five  frames  are 
nsed  to  hlter  ont  incorrectly  identihed  text  regions.  Neither  work  addresses  the  problem  of 
tracking  text  to  hnd  temporal  correspondence  of  detected  text  in  digital  video.  For  example, 
in  Lienhart ’s  system,  all  the  text  in  video  frames  is  saved  in  a  database  after  recognition. 

The  literature  on  object  tracking  is  extensive  and  includes  face  tracking  [4,  14,  21],  human 
body  tracking  [18,  49,  55],  vehicle  tracking  [16],  medical  imaging  [2,  52]  and  agricultural  au¬ 
tomation  [23,  47].  A  detailed  survey  of  tracking  algorithms  for  non-rigid  motion  can  be  found 
in  [1],  but  there  appears  to  be  very  little  appropriate  literature  on  the  tracking  of  text. 

2.3  Our  Approach 

Text  detection  and  tracking  in  digital  video  can  be  viewed  as  a  multi-target  detection  and 
tracking  problem  since  multiple  text  blocks  can  appear  at  any  time  and  can  move  in  different 
directions  within  the  held  of  view.  Unhke  other  typical  tracking  problems  with  a  relatively  static 
background  and  moving  objects,  we  can  have  four  combinations  of  static  and  moving  text  and 
background  if  we  treat  all  non-text  as  background.  It  would  be  very  difficult,  if  not  impossible, 
to  detect  and  track  text  simply  by  motion  analysis.  Although  in  some  special  cases  such  as 
news  or  movie  credits,  we  can  make  use  of  motion  information  to  segment  text,  in  general, 
motion  information  alone  is  not  sufficient.  A  text  detection  scheme  is  required  in  conjunction 
with  tracking. 

A  very  simplistic  solution  to  text  tracking  would  be  to  perform  the  text  detection  frame  by 
frame  and  then  match  the  corresponding  text  blocks  between  consecutive  frames.  Considering 
that  there  may  be  close  to  30  frames  per  second,  the  text  detection  phase  in  the  absence  of 
context  would  be  prohibitive.  We  can,  however,  make  use  of  the  fact  that  text  remains  in 
the  scene  for  a  number  of  consecutive  frames  to  reduce  the  complexity  by  performing  the  text 
detection  process  periodically  and  focusing  on  the  tracking  process  (Figure  1). 

Onr  text  detection  scheme  is  based  on  the  observation  that  text  regions  typically  have 
different  texture  properties  than  the  surrounding  areas.  This  texture  has  similar  frequency  and 
orientation  information,  making  wavelets  a  reasonable  candidate  for  representation.  We  use  a 
hybrid  wavelet /neural  network  segmenter  (Figure  2)  to  segment  text  regions.  To  facilitate  the 
detection  of  various  text  sizes,  we  use  a  pyramid  of  images  generated  from  original  image  by 
halving  the  resolution  at  each  level.  The  extracted  text  regions  are  hypothesized  at  each  level 
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Tracking  Processes 


Figure  1:  The  scheme  for  text  detection  and  tracking  in  digital  video. 


Figure  2:  The  architecture  for  detecting  text  in  video  frames, 
and  then  extrapolated  to  the  original  scale. 

Once  the  text  is  detected,  a  multi-resolution  SSD  (Sum  of  Squared  Differences)  based  track¬ 
ing  module  is  started  to  track  the  detected  text.  We  make  use  of  the  text  contour  to  stabilize 
the  tracking  so  text  can  be  tracked  more  robustly. 

The  remainder  of  this  paper  is  organized  as  follows.  We  describe  onr  text  detection  scheme 
in  Section  3  and  onr  text  tracking  scheme  in  Section  4.  Experimental  results  are  presented  in 
Section  5  and  in  Section  6  we  discuss  some  immediate  applications  of  onr  work. 
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(a) 


(b) 


Figure  3:  (a)  A  video  frame,  (b)  binarization  by  mamially  picking  best  thresholds. 

3  Text  Detection  in  Video  Frames 

Compared  with  other  text  detection  problems,  text  detection  in  digital  video  has  several  new 
challenges.  First,  text  in  digital  video  is  nsnally  embedded  in  a  complex  background,  which 
makes  it  difhcnlt  for  the  Connected  Component  (CC)-based  methods.  Figure  3a  shows  a  typical 
video  frame  and  Figure  3b  is  the  binarized  version  using  a  good  threshold.  The  binarized 
image  shows  connectedness  between  character  components  and  the  background.  Second,  text 
in  digital  video  is  rendered  at  low  resolution.  For  document  images,  scans  of  300  dots  per  inch 
are  common,  which  results  in  “normal”  characters  occupying  an  area  as  large  as  50  X  50  pixels. 
Video  frames  are  usually  limited  in  spatial  resolution  to  as  low  as  352  X  240  with  a  character 
size  of  approximately  10  X  10  pixels.  The  effect  of  low  resolution  is  two-fold:  First,  the  text  is 
often  aliased  so  color  clustering  may  result  in  scattered  text.  Second,  text  components  tend  to 
touch  each  other,  making  detection  difficult. 

In  some  video  frames,  natural  scenes  like  the  leaves  of  a  tree  or  grass  in  a  held  have  textures 
similar  to  text.  In  feature  space,  text  and  nontext  often  overlap.  To  illustrate  this  point, 
we  collected  500  text  blocks  and  500  nontext  blocks  and  used  Linear  Discriminant  Analysis 
(LDA)  [17]  to  project  them  to  a  new  snbspace  where  the  between-class  scatter  is  maximized 
and  the  within-class  scatter  is  minimized,  which  benehts  the  classihcation.  The  projection 
result  (Figure  4)  shows  that  text  and  nontext  overlap  in  the  feature  space.  Therefore,  we 
think  that  using  supervised  learning  to  classify  text  and  nontext  will  be  more  effective  than  the 
nnsnpervised  clustering  techniques  used  in  [25,  56].  An  artihcial  neural  network  is  a  natural 
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Figure  4:  The  distribution  of  text  (o)  and  nontext  (+)  in  the  snbspace  after  LDA  projection. 
Text  and  nontext  overlap. 

choice  as  a  classiher  because  of  its  abihty  to  learn.  Theoretically,  a  three-layer  neural  network 
can  approximate  any  nonlinear  function  after  training.  The  success  of  neural  networks  in  related 
problems  [8,  11,  13,  39,  48]  provides  ns  with  further  motivation  to  rely  on  a  neural  network  as 
a  classiher  to  identify  text  regions. 

Onr  methodology  uses  a  small  window  (typically  16  X  16)  to  scan  the  image  and  classify  each 
window  as  text  or  non-text  using  the  neural  network  (Figure  2).  We  will  address  the  following 
issues  related  to  the  problem: 

•  Scale-space  decomposition  of  the  image, 

•  Feature  extraction  and  selection, 

•  Classihcation  using  the  neural  network, 

•  Text  identihcation. 

3.1  Scale  Space  Decomposition 

Analysis  of  scale  space  provides  a  method  of  identifying  the  spatial  frequency  content  in  local 
regions  within  the  image.  We  use  wavelets  to  decompose  the  image  because  they  provide 
successive  approximations  to  the  image  by  downsampling  and  have  the  ability  to  detect  edges 
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(a)  (b) 

Figure  5:  Single-level  wavelet  decomposition  of  a  video  frame:  (a)  Original  image,  (b)  Decom¬ 
posed  images. 

during  the  high-pass  hltering.  The  low-pass  hlter  creates  successive  approximations  to  the  image 
while  the  detailed  signal  provides  a  feature-rich  representation  in  terms  of  textual  content  [38]. 
This  is  easily  seen  in  the  image  decomposition  shown  in  Figure  5  where  5a  is  the  original  image 
and  5b  is  its  hrst-level  wavelet  decomposition.  Note  that  the  text  region  shows  high  activity 
in  the  three  high-frequency  snbbands  (HL,  LH,  HH).  As  a  result  of  their  local  nature,  only 
wavelets  which  are  located  on  or  near  the  edge  yield  large  wavelet  coefficients,  making  text 
regions  detectable  in  the  high  frequency  snbbands.  High  frequency  components  are  important 
for  recognition.  Jones  and  Palmer  have  shown  that  wavelets  can  closely  represent  the  response 
prohles  of  neurons  in  the  striate  cortex  [27]. 

Depending  on  the  apphcation,  the  choice  of  suitable  wavelets  and  the  corresponding  de¬ 
composition  level  are  based  on  different  criteria.  In  onr  case,  Haar  wavelets  were  chosen  since 
they  are  adequate  for  local  detection  of  hne  segments,  which  characterize  text  regions.  In  addi¬ 
tion  to  their  simplicity,  image  decomposition  can  be  implemented  with  simple  mask  operation, 
i.e.,  weighted  sums  and  comparisons,  making  Haar  wavelets  computationally  efficient.  Further¬ 
more,  httle  or  no  improvement  was  observed  when  using  other  types  of  wavelets,  based  on  onr 
experiments. 

The  scaling  and  wavelet  functions  of  Haar  wavelets  can  be  written  as 

=  X]  -  k)  =  (j){2x)  +  (j){2x  -  1) 

kez 
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Figure  6:  Haar  masks,  (a)  Lowest  frequencies,  (b)  Vertical  high  frequencies,  (c)  Horizontal  high 
frequencies,  (d)  High  frequencies  in  horizontal  and  vertical  directions. 
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V^h{x)  =  -k)  =  4>{2x)  -  4>{2x  -  1)  (2) 

kez 

respectively,  with 

r 

1  for  0  <  X  <  1 
0  otherwise 

The  two- scale  sequences  pk  in  Equation  1  have  non-zero  values  pQ  =  pi  =  1  and  zero  values  for 
all  other  pj .  qk  in  Equation  2  is  zero  except  for  qq  =  1  and  qi  =  —  1 . 

Eor  an  image  I(x^y)  represented  as 


I{x,y)  = 


^0,0 

^0,1 

*0,2Ar-l 

^1,0 

n,i 

H,2Ar-l 

^2W-1,0 

^2W-l,l  • 

•  •  *2iV-l,2iV-l 

[37]  to 

obtain  the  two-dimensional 

(4) 


2Nx2N 


I{x,y): 


LLx,y  —  PkiPk^^ki 


ki  ^k2  —0 

1 


+  2x,k2+2y  —  -{i2x,2y  +  i2x,2y+l  +  i2x+l,2y  +  i2x+l,2y+l)  (5) 


1  1 

LHx^y  —  ~  ^  ^  Pki^k2^ki-\-2x^k2+2y  —  T(^2a:,2^  ~  ^2a:,2^-l-l  T  ^2a:-l-l,2^  ~  '^2x-\-1^2y-\-l)  (6) 


ki  ^k2  —0 


1  i 

HLx^y  —  —  ^  ^  ^kiPk2^ki-\-2x^k2+2y  —  T(^2a:,2^  T  ^2a:,2^-l-l  ~  ^2a:-l-l,2^  ~  '^2x-\-1^2y-\-l)  (7) 

^  ki,k2^0  ^ 

1  1 

HHx^y  —  -  ^  QkiQk2iki+2x,k2+2y  —  “(^2^,2^  “  ^2^,2^-hl  “  ^2^-hl,2^  +  i2x+l,2y+l)  (8) 

^  ki,k2=0  ^ 

The  image  decomposition  process  is  eqnavalent  to  the  convolution  of  the  image  I(x^y)  with  the 
kernels  in  Eignre  6.  The  computational  efficiency  is  obvious. 
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3.2  Feature  Extraction  and  Selection 


The  task  of  feature  extraction  and  selection  cannot  be  solved  in  closed  form.  Onr  methods  of 
feature  extration  and  selection  are  explained  in  this  section. 

3.2.1  Feature  Extraction 

For  onr  proposes  we  extract  features  from  the  wavelet  decomposition  of  the  image  since  (as 
shown  in  Figure  5)  text  has  a  different  texture  from  background  in  snbbands  and  is  detectable. 
We  use  as  features  the  mean  and  the  second-  and  third-order  central  moments.  For  an  TV  X  TV 
snbblock  I  we  calculate  the  mean,  the  second-  and  third-order  central  moments  of  the  snbblock 
which  can  be  written  as 


-|  N-lN-1 

m  =  Enij) 

j=0 

(9) 

1  N-lN-1 

Mn=j^E 

j=0 

(10) 

-|  N-lN-1 

(11) 

All  the  features  are  computed  on  decomposed  snbband  images.  Since  the  original  window  size 
is  16  X  16,  the  maximum  decomposition  level  we  can  choose  is  4,  and  only  one  pixel  is  left  for 
each  snbband  image  on  the  fourth  level. 

3.2.2  Feature  Saliency 

Feature  Saliency  is  dehned  as  a  measure  of  a  feature’s  ability  to  impact  classihcation.  A  bench¬ 
mark  to  compare  feature  saliency  is  a  probability  of  error  criterion,  denoted  by  which 
computes  the  probability  of  error  when  only  a  single  feature  is  used.  Another  saliency  metric 
for  features  measures  the  MLP  outputs,  sampled  over  an  appropriate  range  of  allowable  input 
values  [50].  Empirical  results  indicate  that  this  metric  provides  similar  rankings  to  [48]. 

We  use  Bayes  error  rate  as  a  measure  of  sahency.  The  Bayes  error  rate  can  be  computed  (or 
estimated)  to  determine  whether  or  not  the  feature  will  yield  adequate  separation  between  the 
classes.  In  most  practical  cases,  the  Bayes  error  rate  is  estimated  using  a  hnite  set  of  labelled 
samples  from  the  various  classes.  This  is  typically  done  by  estimating  the  posterior  probability 
of  each  class  for  each  sample,  then  assigning  each  sample  to  the  class  with  the  MAP  (maximum 
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Decomposition 

Level 

Snbband 

Image 

Pe 

E(I) 

Pe 

/^2(/) 

Pe 

LL 

0.3600 

0.3000 

0.3325 

1 

HL 

0.4519 

0.2256 

0.2400 

LH 

0.4281 

0.2256 

0.2344 

HH 

0.3969 

0.2387 

0.2356 

LL 

0.3600 

0.3750 

0.3612 

2 

HL 

0.4169 

0.2131 

0.2087 

LH 

0.4050 

0.2481 

0.2538 

HH 

0.3769 

0.1987 

0.1950 

LL 

0.3600 

0.4669 

0.4613 

3 

HL 

0.4038 

0.3750 

0.3912 

LH 

0.4113 

0.3025 

0.2963 

HH 

0.3556 

0.3069 

0.2919 

Table  1:  Using  Bayes  error  to  measure  feature  saliency  at  different  decomposition  scales. 

a  posteriori)  probability  estimate.  The  percentage  of  samples  misclassified  by  applying  the 
MAP  decision  rule  to  the  posterior  estimates  is  taken  as  an  estimate  of  the  Bayes  error  rate. 

A  Java-based  interface  was  implemented  to  collect  text  samples  and  nontext  samples.  The 
interface  can  allow  the  user  to  load  images  and  then  use  simple  mouse  operations  to  label  text 
regions  and  nontext  regions  to  collect  samples.  We  collect  1000  text  blocks  and  1000  nontext 
blocks  and  used  600  samples  from  each  class  for  training  and  the  remaining  samples  for  testing. 

Table  1  hsts  the  of  each  feature.  Three  conclusions  can  be  drawn  from  Table  1.  First, 
the  features  in  the  third-level  decompostion  have  larger  P^  values  than  features  in  the  hrst  two 
levels.  The  explanation  is  that  the  snbband  image  size  in  the  third  level  is  2x2,  and  therefore 
provides  less  information.  Second,  on  all  levels,  the  P^  values  of  the  mean  are  generally  larger 
than  those  of  the  second-  and  third-order  moments.  Finally,  for  the  second-  and  third  order 
moments,  P^  in  the  high-frequency  snbband  image  (HL,LH,  HH)  is  smaller  than  in  the  low- 
freqnency  snbband  image  (LL)  on  the  same  decomposition  level.  This  implies  that  we  should 
use  the  second-  and  third-order  moments  calculated  from  the  high-frequency  snbband  images 
on  the  hrst  two  decomposition  levels. 

In  order  to  verify  the  above  conclusions,  we  tested  the  data  on  other  wavelets.  We  ignore  the 
third  level  here  since  it  is  unlikely  the  third  level  can  provide  additional  discrimination.  Table  2 
is  the  result  when  we  use  Danbechies-2  wavelets  and  Table  3  is  the  result  when  we  use  Coihets-3 
wavelets.  Although  there  exist  small  variations,  we  can  still  draw  the  same  conclusions  as  in 
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Level 

Decomposition 

Snbimage 

Snbband 

Expectation 

Pe 

Variance 

Pe 

3rd  order  Moment 
Pe 

LL 

0.3600 

0.4181 

0.4644 

1 

HL 

0.4519 

0.4137 

0.4369 

LH 

0.4281 

0.3588 

0.3669 

HH 

0.3962 

0.3350 

0.3337 

LL 

0.3600 

0.4537 

0.4694 

2 

HL 

0.3706 

0.4344 

0.4550 

LH 

0.3950 

0.3725 

0.3769 

HH 

0.3975 

0.3825 

0.3700 

Table  2:  The  Bayes  error  when  we  nse  Danbechies-2  wavelets. 


Level 

Decomposition 

Snbimage 

Snbband 

Expectation 

Pe 

Variance 

Pe 

3rd  order  Moment 
Pe 

LL 

0.3600 

0.4513 

0.5019 

1 

HL 

0.4519 

0.4219 

0.4300 

LH 

0.4281 

0.3350 

0.3225 

HH 

0.3969 

0.3700 

0.4087 

LL 

0.3600 

0.3925 

0.4281 

2 

HL 

0.4206 

0.3506 

0.3550 

LH 

0.3931 

0.3488 

0.3600 

HH 

0.3794 

0.2169 

0.2225 

Table  3:  The  Bayes  error  when  we  nse  Coillets-3  wavelets. 


the  Haar  wavelet  case.  By  comparing  these  results  with  Table  1,  we  can  see  that  the  Haar 
wavelets  outperform  the  Danbechies  and  Coillets  wavelets  for  this  problem. 

3.2.3  Feature  Selection 

The  criterion  for  choosing  features  is  to  keep  the  feature  set  as  small  as  possible  where  the 
features  maximize  the  classihcation  result.  One  obvious  reason  for  a  reduced  feature  set  is  the 
efficiency  of  training  with  smaller  numbers  of  parameters.  Some  authors  have  addressed  the 
impact  on  required  training  samples  of  the  dimensionahty  of  the  input  features  [3,  9,  15].  The 
dimensionality  curve  shows  that  the  number  of  training  samples  required  grows  exponentially 
with  the  number  of  features. 

The  saliency  analysis  in  the  last  section  can  help  ns  choose  a  feature  set.  We  rank  the 
twelve  features  in  the  order  of  increasing  Bayes  error  P^  as  shown  in  Table  4.  We  nse 


12 


feature 

f^2HH 

f^SHL 

iAhl 

f^2HL 

f^2LH 

f^3LH 

f^2HH 

f^SHL 

f^2LH 

Pe 

0.195 

0.198 

0.208 

0.213 

0.225 

0.225 

0.234 

0.235 

0.238 

0.240 

0.248 

0.253 

rank 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

Table  4:  Ranking  the  features  in  the  order  of  increasing  P^. 


and  to  denote  the  second-  and  third-order  moments  respectively,  with  j  representing  the 
decomposition  level  and  i  representing  the  snbband. 

Onr  scheme  for  selecting  features  is  to  use  the  hrst  feature  (i^shh)  Table  4  to  train  and 
classify  and  then  calculate  the  corresponding  classihcation  error.  We  then  add  another  feature 
and  use  the  hrst  two  features  and  I^2Hh)  repeat  the  process.  At  each  step  one  more 

feature  is  added  until  all  twelve  features  are  used. 

One  hmitation  of  the  Bayes  method  is  its  inability  to  deal  with  high- dimensional  data.  We 
therefore  use  a  neural  network  instead  of  the  Bayes  classiher  used  in  the  analysis  of  sahency. 
By  changing  the  number  of  input  features,  the  parameters  of  the  neural  network  (the  numbers 
of  input  and  hidden-layer  nodes)  also  need  to  change.  For  each  group  of  inputs,  we  adjust 
the  number  of  hidden-layer  nodes,  so  the  minimal  classihcation  error  can  be  reached.  Figure 
7  illustrates  the  relation  between  the  classihcation  error  and  the  number  of  features  used. 
Although  the  best  classihcation  result  is  achieved  when  all  twelve  features  are  used,  we  have 
dropped  the  last  four  features  after  considering  the  balance  between  accuracy  and  processing 
time  (classihcation  is  conducted  for  each  16  X  16  window  in  each  video  frame  and  accounts  for 
the  principal  processing  time  for  text  detection).  The  hrst  eight  features  in  Table  4  are  selected 
for  classihcation  and  the  corresponding  conhgnration  of  the  neural  network  consists  of  three 
layers  with  8  input  nodes,  12  hidden  nodes  and  1  output  node. 

3.3  Training  the  Neural  Network 

After  selecting  the  features,  we  re-train  the  neural  network.  The  samples  we  coUected  in  feature 
analysis  may  be  not  enough  to  cover  the  feature  space  extensively.  Although  it  is  easy  to  get 
representative  samples  of  text,  it  is  more  difficult  to  get  representative  samples  of  non-text 
since  non-text  spans  a  vast  space.  To  deal  with  this  we  use  a  bootstrap  method  recommended 
by  Sung  and  Poggio  [54]  to  re-train.  The  idea  is  that  the  training  samples  are  collected  during 
training  instead  of  before  training  as  foUows: 
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Number  of  features  used 

Figure  7:  Test  set  classification  error  as  the  number  of  features  is  increased. 

1.  Create  an  initial  set  of  training  samples  which  includes  a  complete  set  of  text  samples 
and  a  portion  of  non-text  samples. 

2.  Train  the  network  on  these  samples. 

3.  Run  the  system  on  a  video  frame  which  contains  no  text  and  add  image  blocks  which  the 
network  incorrectly  classifies  as  text  to  the  non-text  sample  set. 

4.  Repeat  steps  2  and  3  until  the  accuracy  converges. 

After  training,  the  neural  network  maps  each  block  into  a  real  value  between  0  and  1  for 
nontext  and  text  respectively.  Figure  8a  shows  the  distribution  of  the  neural  network  outputs 
when  testing  on  text  and  nontext  blocks  after  training.  A  threshold  of  0.5  is  a  reasonable  choice 
to  indicate  whether  the  window  contains  text  or  not.  We  can  see  that  the  text  and  nontext 
classes  are  well  separated  when  using  the  neural  network  as  a  classifier. 

3.4  Text  Identification 

After  training  the  neural  network,  we  use  a  16  X  16  window  to  scan  the  video  frame  to  classify 
each  window  as  text  or  nontext.  The  larger  the  step  by  which  window  moves,  the  smaller  the 
number  of  windows  to  be  processed  but  the  less  refined  the  result.  For  a  video  frame  of  size 
352  X  240,  the  system  needs  to  classify  75825  windows  if  the  window  moves  1  pixel  at  a  time; 
this  takes  more  than  4  seconds  on  a  Sun  Ultra  I.  On  the  other  hand,  for  a  step  size  of  16  pixels, 
where  the  windows  don’t  overlap,  a  text  line  may  be  split  between  the  windows  and  may  not  be 
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(a)  (b)  (c)  (d) 

Figure  9:  (a)  Original  frame,  (b)  The  classified  label  map,  (c)  The  segmented  text  area  corre¬ 
sponding  to  (b),  (d)  The  segmented  text  area  after  post-processing  and  bounding  box  genera¬ 
tion. 


detected.  Considering  the  trade-off  between  precision  and  speed,  we  move  the  window  4  pixels 
at  a  time. 

If  a  window  is  classihed  as  text,  all  the  pixels  in  this  window  are  labeled  as  text.  Those 
pixels  which  are  not  covered  by  any  text  window  are  labeled  0.  The  result  of  the  classihcation  is 
a  label  map  of  the  original  image.  Figure  9(a)  is  a  video  frame  and  Figure  9(b)  is  the  classihed 
label  map  corresponding  to  Figure  9(a).  Figure  9(c)  shows  the  extracted  text  regions.  We  can 
see  that  all  of  the  text  is  labelled  correctly,  but  there  are  some  small  isolated  areas  which  are 
incorrectly  labeled  as  text.  We  use  size  consistency  between  blocks  to  hlter  out  these  areas. 
The  bounding  box  of  the  text  area  is  generated  by  connected  component  analysis  of  the  text 
windows.  Figure  9(d)  is  the  result  after  we  hlter  out  the  non- text  areas  and  generate  the 
bounding  box. 


3.5  Result  Integration 

As  depicted  in  Figure  2  we  use  a  three-level  pyramid  approach  to  detect  text  with  a  wide  range 
of  font  sizes.  After  the  detection  step  is  apphed  at  each  level,  the  bounding  boxes  are  mapped 
back  to  the  original  input  image.  If  the  bounding  boxes  at  one  level  do  not  overlap  with  the 
boxes  at  other  levels,  we  simply  merge  them  into  the  original  image.  In  some  cases,  however, 
text  strings  or  parts  of  text  strings  appear  at  multiple  levels  so  the  boxes  detected  at  different 
levels  overlap  (Figure  10(a)).  In  this  case  we  merge  them  by  forming  a  bounding  box  which 
contains  both  of  them.  Figure  10(b)  is  the  integration  result  corresponding  to  Figure  10(a). 
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(a) 


(b) 

Figure  10:  (a)  The  text  regions  detected  at  different  levels,  (b)  Integrated  result. 

3.6  Postprocessing 

The  result  of  our  text  detection  algorithm  is  a  set  of  text  blocks,  which  may  contain  multiple 
text  lines  if  the  text  is  spatially  compact  (Figure  11a).  For  the  text  detection  task,  we  can 
keep  the  result  in  the  form  of  blocks  since  it  will  beneht  image  resolution  enhancement  [34] 
and  OCR  (multiple  text  hues  can  provide  more  context  information).  However,  these  blocks 
are  not  typically  good  candidates  for  tracking  because  of  their  size.  Furthermore,  different  text 
lines  in  the  same  text  block  may  move  at  different  speeds,  or  some  text  may  remain  static  while 
other  text  moves.  For  these  reasons,  we  need  to  divide  the  text  blocks  into  more  compact  text 
elements,  typically  a  hne  or  a  word. 

We  use  the  projection  prohle  to  extract  text  elements  from  text  blocks.  To  avoid  the 
difficnlties  of  normal  text  and  reverse  text,  we  use  the  projection  prohle  of  the  Canny  edge 
map  of  the  image  instead  of  the  image  itself.  Figure  11  is  the  set  of  generated  text  elements 
corresponding  to  the  detection  result  in  Figure  11a.  Blocks  which  do  not  have  a  horizontal 
orientation  wiU  be  tracked  as  a  whole. 

4  Text  Tracking 

4.1  Problem  Overview 

Once  text  is  detected,  the  tracking  process  is  started.  Text  motion  in  digital  video  is  of  three 
types:  static;  simple  linear,  rigid  motion  (for  example,  scrolling  movie  credits);  and  complex 
nonlinear,  nonrigid  motion  (for  example,  scene  text  zooming  in  and  out,  or  rotation  caused  by 
camera  motion).  For  the  hrst  two  cases,  a  simple  image  matching  technique  may  track  the  text 
well.  For  the  third  case,  a  more  robust  technique  is  required.  Onr  goal  is  to  design  a  scheme  to 
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(b) 


(d) 


Figure  11:  (a)  The  result  of  text  detection,  (b)  Canny  edge  map  of  detected  text  block,  (c) 
Horizontal  projection  profile  of  (b),  (d)  Generated  text  lines. 


efficiently  track  text  with  both  simple  and  complex  motion. 

An  arbitrary  motion  can  encompass  a  wide  range  of  possible  motion  transformations.  A 
general  paradigm  for  estimating  motion  parameters  would  be  extremely  difficnlt,  if  not  impos¬ 
sible,  to  develop.  In  the  hteratnre  researchers  have  proposed  various  approaches  to  deal  with 
different  motion  transformations,  given  certain  restrictions  imposed  on  the  object’s  behavior. 
Robust,  reliable  visual  tracking  of  an  object  in  a  complex  environment  will  require  the  inte¬ 
gration  of  several  different  visual  modules,  each  using  a  different  criterion  and  each  employing 
different  assumptions  about  the  incoming  images.  The  modules  will  be  selected  so  that  they 
complement  each  other  —  if  one  module  fails,  the  other  one  can  come  to  its  aid. 

We  can  treat  a  text  line  as  a  closed  set  S  in  the  plane,  where  the  bounding  box  of  the 
text  hne  corresponds  to  the  boundary  of  S.  The  boundary  B  and  the  interior  I  oi  S  provide 
complementary  information:  B  determines  the  shape  (and  size)  of  S  and  I  determines  its 
content  (intensity  or  texture).  We  believe  that  the  integration  of  tracking  modules  based  on 
image  content  and  shape  will  make  the  tracking  processing  more  powerful  and  stable. 

We  use  a  general  SSD  (Sum  of  Squared  Differences)  module  to  measure  the  similarity  of 
image  content  (corresponding  to  the  interior  /)  and  then  make  use  of  information  about  the 
text  contour  to  stabilize  the  position  of  the  text  block  (corresponding  to  the  boundary  B).  In 
the  rest  of  this  section  we  wiU  give  the  details  of  the  implementation. 
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4.2  SSD  (Sum  of  Squared  Differences)  Based  Image  Matching 

As  either  the  camera  or  the  text  moves,  the  pattern  of  image  intensities  changes  in  complex 
ways.  We  can  describe  these  changes  as  image  motion: 

I{x,y,t+T)  =  I{x  -  ^{x,y,t,T),y-  rj{x,y,t,T))  (12) 

A  subsequent  image  taken  at  time  t  +  r  can  be  obtained  by  moving  every  point  in  the  image 
taken  at  time  t  by  an  amount  8  —  (^,  7/),  represented  by  an  affine  transform: 

8  —  Dx  +  d  (13) 

where  H  is  a  2  X  2  deformation  matrix  and  d  is  a  2  X  1  displacement  vector. 

If  we  choose  minimization  of  the  SSD  (Sum  of  Squared  Differences)  as  a  matching  metric, 
then  the  basic  tracking  algorithm  can  be  described  as  follows:  Given  a  reference  block  in  image 
I,  and  a  search  region  W  in  image  J,  determine  the  six  parameters  of  the  deformation  matrix 
D  and  displacement  vector  d,  which  minimize 

e  =  f  f  [J(Dx  +  d)  —  I(x)\^dx  (14) 

J  Jw 

The  solution  to  Equation  (14)  can  be  very  complex.  The  quality  of  the  estimate  depends 
on  many  factors  such  as  the  size  of  the  feature  window,  the  textnredness  of  the  image  within 
it,  the  amount  of  camera  motion  between  frames,  etc.  Under  the  assumption  of  small  inter¬ 
frame  motion,  we  can  simplify  the  equation  by  setting  the  deformation  matrix  D  to  the  identity 
matrix.  Equation  (14)  then  simphhes  to 

6=  /  /  [J(x  +  d)  -  I{x)]‘^dx  (15) 

J  Jw 

The  model  described  by  Equation  (15)  is  a  pure  translational  model.  The  search  space  lUis 
within  some  range  of  the  predicted  location.  We  use  a  simple  prediction  technique  to  locate  W 
more  accurately.  Suppose  are  the  text  positions  in  the  current  and  past  frames;  then 

T  (^n  ~  ^n  — l) 

In  essence  this  is  a  second-order  hnear  predictor.  Although  we  could  use  more  precise  and 
robust  models  such  as  high-order  adaptive  hnear  prediction  or  Kalman  hltering,  this  will  add 
to  the  computational  complexity. 
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Frame  Number 
(b) 


Figure  12:  Using  an  SSD-based  model  to  track  a  movie  credit.  The  initial  position  is  specified 
mannally.  (a)  Tracking  result,  (b)  Graph  of  SSD  in  consecutive  frames.  SSD  is  obviously  higher 
in  frame  95-105  since  the  text  hne  moves  onto  a  complex  background. 

The  pure  translational  model  tracks  well  in  the  cases  of  simple  motion  or  static  text.  Figure 
12a  shows  an  example  of  tracking  movie  credits.  Although  the  text  line  being  tracked  moves 
from  a  clean  background  to  a  complex  background  (in  the  middle  of  the  frame),  the  text  line 
is  correctly  tracked.  Figure  12b  shows  the  graph  of  the  SSD. 

Evidently,  the  pure  translational  model  is  not  adequate  to  handle  scale,  rotation  and  per- 
pective  distortions  [21,  51].  While  it  is  computationally  efficient,  over  a  long  sequence  the  error 
will  accumulate  to  the  point  that  the  tracker  loses  the  target  [51].  Figure  13  shows  the  tracking 
result  when  we  track  a  text  line  in  a  conference  video  sequence.  Both  camera  and  scene  change 
and  the  text  moves  in  a  complex  way  including  zooming,  rotation,  etc.  Although  the  SSD  is 
small  enough  in  consecutive  frames  (Figure  13b),  the  tracker  gradually  loses  the  target  after 
several  frames. 

4.3  Contour-Based  Text  Stabilization 

When  text  undergoes  complex  motion,  the  SSD-based  pure  translational  model  does  not  track  it 
weU.  There  exist  more  elaborate  tracking  models  including  parametrized  models  for  articulation 
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Frame  850 


Frame  860 


Frame  870 


(a) 


Figure  13:  SSD-based  methods  will  fail  when  there  are  distortions.  In  this  conference  video, 
there  exist  scale  change  and  rotation. 


[7,  45],  nonrigid  deformations  [6,  46]  and  hnear  image  snbspaces  [5,  40].  However,  the  resulting 
algorithms  rely  on  nonlinear  optimization  techniques  which  require  from  several  seconds  to 
several  minutes  per  frame  to  compute.  Instead,  we  use  the  text  contour  to  stabilize  the  tracking 
process. 

The  stabilization  process  can  be  implemented  efficiently  in  the  following  way: 


1.  To  matched  text  position  5  =  (xl,  yl,  x2,  y2),  generate  a  slightly  larger  text  block  = 

(xl  —  —  (5,  x2  +  (5,  y2  +  6).  The  real  text  position  will  be  included  within  sq. 

2.  Generate  the  edge  map  of  by  calculating  the  Canny  edges.  We  use  the  edge  map 
instead  of  thresholding  the  image  to  avoid  the  difficulties  of  identifying  normal  text  and 
inverse  text. 

3.  Apply  a  horizontal  smearing  process  so  the  edge  map  can  be  grouped  to  form  a  text  block. 

4.  Extract  connected  components  and  their  positions  yl\  x2',  y2^)  to  represent  the 

rehned  text  position. 


Figure  Ida  shows  the  initial  matching  position  and  Figure  I4e  shows  the  rehned  position. 
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(b)  (c)  (d) 


Figure  14:  Text  position  refinement  based  on  contour  stabilization,  (a)  Initial  matched  text 
position,  (b)  an  enlarged  text  block,  (c)  edge  map,  (d)  smeared  image,  (e)  refined  position. 

The  scheme  described  above  works  well  when  text  is  moving  on  a  relatively  clean  back¬ 
ground.  However,  when  text  moves  over  a  complex  background,  problems  may  occur.  Over 
a  complex  background,  text  may  touch  other  graphical  objects.  Since  onr  contour  extraction 
process  operates  on  the  slightly  enlarged  box  (as  is  necessary  since  we  must  leave  extra  space 
for  contour  extraction),  the  text  contour  will  become  larger  even  though  the  text  is  in  finear, 
rigid  motion. 

When  the  text  moves  on  a  complex  background,  the  SSD  between  two  consecutive  frames 
becomes  larger  since  the  pixel  values  in  the  background  change  considerably  (Figure  12b).  In 
this  case,  the  contour  is  harder  to  extract  and  contour  stabilization  cannot  achieve  a  satisfactory 
result.  In  this  case  we  need  to  stop  the  stabilization  process  and  depend  only  on  the  SSD 
model.  Once  the  text  moves  out  of  the  complex  background  (SSD  becomes  smaller  again),  the 
stabilization  can  be  started  again. 

4.4  Using  Multi- Resolution  Matching  to  Reduce  Complexity 

Onr  SSD-based  module  is  region-based  and  its  computational  cost  is  considerable  when  we 
track  a  large  text  line.  We  perform  matching  from  coarse  to  fine  in  a  hierarchical  fashion  on 
a  Gaussian  image  pyramid.  For  a  frame  It  of  size  n;  X  h,  a  Gaussian  pyramid  is  formed  by 
combining  several  reduced-resolution  Gaussian  images  of  frame  /f,  where  t  is  the  frame  number 
and  I  =  {0, 1,  2,  ....,7V}  represents  the  level  in  the  pyramid.  The  size  of  the  frame  at  level  i  is 
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(a)  (b) 

Figure  15:  Multiple  resolution  based  image  matching,  (a)  Text  block  pyramid  in  Frame  650, 
(b)  Image  pyramid  formed  with  Frame  651.  The  matching  process  is  conducted  between  images 
of  the  corresponding  scale. 

W  h 

2*  2*  ■ 

The  matching  is  conducted  starting  at  the  coarsest  resolution  (level  TV).  Each  level  con¬ 
tributes  to  determine  the  position  of  the  matching  on  the  next  level.  The  search  for  the  mini¬ 
mum  SSD  starts  on  level  TV  over  a  window  size  S  =  (2s  +  1)  X  (2s  +  1).  Suppose  the  matching 
point  found  is  P^(x^  y).  Then  at  level  TV  —  1,  the  search  for  the  minimum  SSD  will  be  conducted 
around  pixel  P^~^(2x^2y)  over  the  same  window  size  S  =  (2s  +  1)  X  (2s  +  1).  This  process 
continues  until  the  hnest  resolution  (level  0)  is  reached.  Although  at  each  level,  the  maximum 
displacement  supported  by  the  search  is  5,  which  is  much  smaller  than  what  is  required  in  a 
one- step  search,  the  displacement  is  doubled  after  each  level.  The  total  displacement  reached  at 
the  hnest  level  is  5  x  2^.  Figure  15  illustrates  such  a  process.  The  level  of  the  pyramid  depends 
on  the  size  of  the  text  block.  If  the  text  line  is  small  enough,  no  pyramid  wiU  be  formed  and 
matching  will  be  conducted  directly  at  the  original  image  scale. 
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Figure  16:  Making  nse  of  temporal  tendency. 


4.5  Temporal  Tendency 

Althongh  the  goal  of  onr  system  is  to  track  text  in  the  general  case,  the  temporal  tendency  of 
text  motion  can  help  facilitate  the  tracking  process.  If  we  hnd,  for  example,  the  text  lines  are 
scrolling  np  (or  down),  we  can  deduce  that  new  text  lines  will  appear  at  the  bottom  (or  top) 
of  the  video  frame.  We  can  then  restrict  the  costly  text  detection  process  to  a  relatively  small 
region  (Figure  16a). 

When  text  crosses  the  frame  horizontally,  analysis  of  temporal  tendency  is  necessary  to 
track  text  correctly.  This  often  happens,  for  example,  in  newscasts  when  announcements  cross 
the  bottom  of  the  TV  screen.  We  have  no  way  to  track  the  whole  text  line  in  this  case  since 
the  virtual  text  line  is  much  bigger  than  the  frame.  In  this  case,  we  separate  the  text  line  into 
words  and  track  the  words.  If  we  hnd  that  all  the  words  in  a  text  line  are  moving  horizontally 
in  the  same  direction  and  new  words  continue  to  appear  on  the  same  line,  we  can  draw  the 
conclusion  that  the  text  is  crossing,  and  we  only  need  to  monitor  the  small  areas  where  new 
text  words  may  appear  (Figure  16b). 

5  Implementation  and  Experiments 

We  have  implemented  three  tools  based  on  the  methods  described  above:  A  text  detection 
tool  [TextDetect)^  a  text  tracking  tool  (Text Tracker)^  and  a  text  detection  and  tracking  tool 
(TextDT).  All  the  programs  are  written  in  C  and  run  on  a  Sun  workstation  with  operating 
system  Solaris  2.5.  TextDetect  detects  text  regions  in  single  video  frame.  TextTracker  a 

text  region  whose  initial  position  is  specihed.  TextDT  combines  the  two  modules  and  conducts 
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Figure  17:  The  relation  between  SSE  and  training  time. 


detection  and  tracking  in  a  sequence  of  frames.  We  evaluated  the  performance  of  these  tools  in 
the  following  experiments. 

5.1  Text  Detection 

We  collected  200  video  frames  and  split  them  into  a  training  set  (50  frames)  and  a  testing 
set  (150  frames).  The  training  samples  (text  blocks  and  non-text  blocks)  were  collected  from 
video  frames  in  the  training  set  using  the  bootstrap  method  described  above.  The  training  time 
depends  on  the  required  Snm-Sqnared  Error  (SSE).  Figure  17  shows  that  the  smaller  SSE,  the 
longer  the  training  of  the  neural  network. 

The  detection  procedure  requires  about  1  second  on  a  Sun  Workstation  Ultra  to  process  a 
352  X  240  frame  with  nnoptimized  code.  Classihcation  (including  feature  extraction)  takes  0.5 
seconds;  postprocessing  and  image  input  and  output  take  another  0.5  seconds. 

The  text  collected  includes  both  scene  text  and  graphic  text  with  multiple  font  sizes.  We 
evaluated  onr  text  detection  system  on  the  block  level  and  did  not  segment  text  blocks  into 
words  and  characters.  A  text  block  may  contain  one  or  more  text  lines  which  are  close  to 
each  other.  As  shown  in  Table  5,  there  are  a  total  of  283  text  blocks  in  the  150  frames.  261 
(92.4%)  of  them  were  correctly  detected  by  onr  algorithm  (Figure  18)  and  22  (7.6%)  of  them 
were  missed.  Errors  occur  primarily  because  of  low  resolution  (Figure  19(a),  the  text  block  at 
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Frame  Number 

Text  Line 

Detected 

Missed 

False  Detection 

150 

283 

261  (92.4  %) 

22(7.6%) 

23 

Table  5:  Text  detection  result 


(a)  (b)  (c)  (d) 

Figure  18:  Part  of  the  identification  results,  (a)  Reverse  text  with  large  font  size,  (b)  Text  has 
low  resolution  (bottom)  or  is  scene  text  (text  on  T-shirt  of  athlete,  (c)  Text  with  different  font 
size,  (d)  Text  with  different  font  style. 

the  bottom  of  the  frame)  or  small  text  block  size  (Figure  19(b),  text  strings  “31”  and,  “34”). 

On  the  other  hand,  23  non-text  blocks  were  misclassihed  as  text  blocks.  Figures  19(c)  and 
(d)  are  two  examples.  Further  training,  domain- specihc  training,  or  attempting  OCR  should 
overcome  these  problems. 

With  slight  modihcations,  TextDetect  could  be  used  in  other  detection  or  segmentation 
tasks.  One  direct  apphcation  is  to  document  image  segmentation.  We  change  the  number  of 
nodes  in  the  output  layer  to  3,  which  corresponds  to  three  classes  (text,  background  and  image). 


(a)  (b)  (c)  (d) 

Figure  19:  Examples  of  misdetection:  (a)  A  text  block  at  the  bottom  of  the  frame  is  not 
detected  due  to  low  resolution,  (b)  Text  strings  “31”  and,  “34”  are  not  detected  due  to  the 
small  block  size,  (c)  At  the  top  of  the  frame  a  non-text  block  is  misclassihed,  (d)  In  the  middle 
of  the  frame  a  non-text  block  is  misclassihed. 
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(a)  (b)  (c)  (d) 

Figure  20:  After  slight  modificatioii,  TextDetect  can  be  used  to  segment  document  images 
(white:  text,  black:  picture,  gray:  backronnd).  (a)  A  document  image  scanned  from  a  magazine 
cover,  (b)  Segmentation  result  of  (a),  (c)  A  document  image  in  the  University  of  Washington 
database,  (d)  Segmentation  result  of  (c). 


The  same  feature  extraction  scheme  is  used  and  the  number  of  hidden  layer  nodes  is  adjusted 
to  maximize  the  classihcation  result.  We  chose  a  training  set  from  the  gray-scale  documents 
in  University  of  Washington  database  and  tested  the  system  on  the  rest  of  the  documents  in 
the  same  database  and  on  scanned  magazine  covers.  Figure  20  shows  that  the  segmentor  works 
weU  even  when  the  layout  of  the  document  is  complex. 


5.2  Text  Tracking 

We  collected  ten  video  sequences  for  experiments  on  text  tracking  from  a  wide  variety  of  video 
sources,  including  movie  credits,  TV  programs,  football  games,  news  and  conference  videos,  etc. 
A  description  of  the  video  sequences,  which  included  static  text,  simple  motion,  and  complex 
motion,  is  given  in  in  Table  6. 

We  conducted  onr  experiments  in  two  steps.  First,  we  studied  the  performance  of  text 
tracking  by  manually  specifying  the  initial  position  of  the  text  block,  to  hlter  out  the  possible 
effects  of  text  detection. 

Unfortunately,  quantitative  evaluation  of  tracking  accuracy  is  not  easy  because  of  the  lack 
of  ground  truth  data.  We  output  the  tracking  results  in  the  form  of  MPEG  video  which  can  be 
viewed  at  website  http://documents.cfar.umd.edu/LAMP/Media/Projects/TextTrack/.  Some 
of  the  tracking  results  are  shown  in  Figure  21. 

Onr  experiments  show  that  the  tracker  can  work  well  when  the  text  is  in  simple  (rigid,  hnear) 
motion  (Figure  21c)  or  when  the  motion  is  complex  but  the  background  is  clean  (Figures  21a 
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Video 

Video 

Frame 

Text 

Motion 

ID 

Type 

Number 

Line 

Description 

1 

Movie  Credit 

2600 

29 

Complex 

2 

Movie  Credit 

2600 

87 

Scrolling 

3 

News 

800 

8 

Static 

4 

Sports 

200 

1 

Complex 

5 

Sports 

200 

2 

Complex 

6 

Conference 

800 

4 

Complex 

7 

Conference 

800 

1 

Complex 

8 

Scene 

228 

7 

Crossing 

9 

TV  program 

1200 

1 

Zooming  in 

10 

Commercial 

300 

1 

Crossing 

Table  6:  Video  sources  used  in  experiments 


and  e).  When  the  text  moves  arbitrarily  on  a  complex  background,  the  tracker  may  be  confused 
in  some  frames  but  will  adjust  its  position  once  the  text  moves  to  a  relatively  clean  background. 
In  Figure  21b,  the  initial  position  specihed  touches  another  object  (the  edge  of  the  map).  When 
the  text  size  increases,  the  tracking  position  deviates  for  some  frames  but  adjusts  when  the  text 
grows  large  enough  to  be  separated  from  the  map.  Another  example  tracks  the  number  on  an 
athlete’s  jersey  in  a  football  game.  The  task  is  complicated  by  the  athlete’s  running,  jumping, 
and  rotating,  as  well  as  by  camera  motion  (Figure  2 Id).  The  tracker  loses  part  of  the  target  in 
some  frames  but  then  adjusts  to  the  correct  position. 

The  average  tracking  time  for  one  text  block  is  about  0.2  second  per  frame.  If  the  text  hne  is 
large  enough,  we  can  use  multi-resolution  matching  to  reduce  this.  By  tracking  text  we  can  get 
text  blocks  as  well  as  temporal  correspondences  of  the  blocks.  If  we  perform  detection  frame  by 
frame,  extra  time  is  required  to  hnd  the  correspondences  of  blocks  between  consecutive  frames. 

One  application  of  onr  text  tracking  tool  TextTracker  h  to  ground  trnthing  of  video  data; 
it  is  being  used  for  this  purpose  in  the  Video  Processing  Evaluation  Resource  (ViPER)  [12] 
project  in  onr  lab.  One  of  the  functions  ViPER  provides  is  a  mechanism  to  create  ground 
truth  data  (text,  faces,  etc)  and  perform  some  quantitative  evaluations.  According  to  some 
accounts  it  may  cost  as  much  as  $40,000  to  ground  truth  a  120-minnte  video  manually.  With 
the  tracking  algorithm,  we  reduce  the  cost  signihcantly.  The  user  can  specify  the  initial  box 
and  then  track  the  rest  of  the  frames.  After  tracking,  the  user  can  check  and  adjust  the  text 
postion  if  necessary. 
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Figure  21:  Demonstration  of  TextTracker^s  performance  on  various  video  sources.  The  ini¬ 
tial  position  of  the  text  block  is  manually  specihed.  (a)  Tracking  of  movie  credits:  Text 
zooming  with  displacement,  (b)  Zooming  scene  text,  (c)  Text  scrolling  up  across  varying 
background,  (d)  Tracking  text  on  a  football  athlete’s  jersey,  (e)  Tracking  text  in  a  confer¬ 
ence  room  scene.  All  of  these  and  other  tracking  results  can  be  found  as  MPEG  video  at 
http:/ /document  s.cfar.nmd.edn/LAMP/Media/Projects/Text  Track/ 
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(e)  Frame  652 


(f)  Frame  974 


(g)  Frame  1380 


(h)  Frame  2000 


Figure  22:  Text  detection  and  tracking  in  the  movie  Star  Wars. 


(a)  (b)  (c)  (d) 

Figure  23:  “Welcome  to  the  Language  and  Media  Processing  Laboratory”,  (a)  Frame  30,  (b) 
frame  60,  (c)  frame  114,  (d)  frame  170. 

In  the  second  part  of  onr  experiments  we  combined  the  text  detection  and  the  tracking 
modules.  The  input  to  TextDTis  a  sequence  of  images  and  the  output  is  a  sequence  of  detected 
and  tracked  text  regions  represented  as  an  MPEG  video  or  a  ASCII  hie  which  records  the 
information  for  all  the  text  blocks. 

Figure  22  shows  a  tracking  result  for  the  movie  Star  Wars.  There  are  2600  frames  in  the 
sequence,  which  includes  static,  zooming,  and  scrolling  text.  Figure  23  shows  tracking  results 
for  a  transverse  text  hne.  We  detect  it  as  horizontal  scrolhng  by  the  temporal  tendency  analysis 
described  in  Section  4,  and  we  thus  divide  the  line  into  words  and  track  them.  AU  these  results 
can  be  viewed  at  website  http://docnments.cfar.nmd.edn/LAMP/Media/Projects/TextTrack/. 

The  processing  time  changes  considerably  with  the  number  of  text  lines  per  frame.  Tracking 
movie  credits  takes  more  time  than  other  video  types  since  there  are  more  text  lines  per  frame 
in  movie  credits.  For  example,  for  the  movie  Star  Wars^  it  takes  about  1  second  to  track  one 
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frame  (the  average  number  of  text  lines  in  a  frame  is  5),  while  it  takes  only  0.17  seconds  to  track 
text  in  a  football  game  (only  one  text  line  is  moving  in  all  of  the  frames).  But  as  we  indicated 
above,  we  can  detect  the  text  as  well  as  the  temporal  correspondences  of  the  text  blocks. 

There  are  several  limitations  to  onr  system.  First,  text  tracking  is  started  only  when  text  is 
detected.  If  the  text  detection  module  fails,  the  system  wiU  miss  the  text.  Second,  onr  tracker 
uses  SSD-based  image  matching  to  approximate  the  position  and  then  uses  the  text  contour  to 
rehne  the  position.  Therefore,  the  system  can  only  be  used  to  track  text.  In  addition,  since  we 
use  speed  prediction  to  predict  the  position  of  the  text,  the  text’s  acceleration  is  limited.  The 
tracker  has  difhcnlties  when  text  moves  too  abruptly  or  keeps  moving  on  a  complex  background. 
This  happens  especially  in  sports  video.  For  example,  when  tracking  the  name  of  an  athlete  on 
a  jersey,  the  text  may  occlude  quickly  because  of  the  athlete’s  jumping  and  rotating. 

6  Applications  and  Discussion 

In  this  section  we  wiU  discuss  several  applications  that  involve  using  the  detected  and  tracked 
text  for  video  indexing. 

6.1  OCR  in  Digital  Video 

For  text-based  video  indexing,  OCR  is  required  to  convert  the  text  from  image  format  to  plain 
text.  One  stumbling  block  for  OCR  is  that,  as  mentioned  above,  the  text  in  digital  video 
is  usually  at  low  resolution.  As  a  result,  it  is  beyond  the  limits  of  most  commercial  OCR 
software.  Figure  24a  shows  a  text  block  extracted  from  a  video  frame.  Even  when  we  manually 
pick  the  best  threshold  to  binarize  the  image  (Figure  24b),  there  is  still  no  output  from  the 
OCR  software^,  even  though  the  text  is  clearly  readable. 

Text  in  digital  video  is  limited  in  spatial  resolution  with  a  font  size  of  approximately  10  X  10 
pixels.  For  typical  document  images  that  commercial  OCR  software  works  on,  scans  of  300  dots 
per  inch  are  common,  translating  to  characters  occupying  an  area  as  large  as  50  X  50  pixels. 
This  motivates  ns  to  perform  text  resolution  enhancement  to  achieve  reasonable  OCR  results 
using  commercial  OCR  software  so  that  text-based  indexing  and  retrieval  is  possible. 

One  way  to  improve  text  resolution  is  to  use  multiple  frames.  This  type  of  technique  needs 
some  sort  of  movement  within  the  frame  and  the  objects  contained  in  it,  and  usually  it  requires 

^We  have  tested  Caere  Oinnipage  8.0,  Xerox  TextBridge  Pro98  and  Xerox  ScanWorx. 
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Figure  24:  (a)  A  text  block  with  a  font  size  of  approximately  6-7  pixels,  (b)  Binarization 
by  manually  picking  the  best  threshold.  There  is  no  output  from  commercial  OCR  software 
although  the  text  is  clearly  readable. 
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Figure  25:  Comparison  of  OCR  results.  (a)No  enhancement,  (b)Zero-order  interpolation, 
(c) Shannon  interpolation. 

estimation  of  the  movement  to  sub-pixel  accuracy.  If  the  text  is  static  or  moves  in  integer  pixel 
steps,  it  will  not  have  signihcant  influence.  We  therefore  use  image  interpolation  to  improve 
the  resolution  directly. 

Onr  experiments  have  given  encouraging  results.  We  chose  45  text  blocks  extracted  from 
video  frames  and  used  Zero-order  hold  interpolation  and  Shannon  interpolation  to  increase  the 
image  resolution  [34].  The  same  binarization  scheme  and  OCR  software  (Xerox  TextBridge 
Pro98)  was  used  to  test  these  three  groups  of  images.  The  OCR  accuracy  for  the  original 
images  was  13%.  This  rises  to  34%  for  Zero-order  hold  interpolation  and  to  66.8%  for  Shannon 
interpolation  (Figure  25).  A  detailed  description  of  this  topic  is  beyond  the  scope  of  this  paper 
and  can  be  found  in  [34]. 

6.2  Text-based  Indexing  and  Retrieval  in  Digital  Video 

Onr  ultimate  goal  is  to  build  a  text-based  indexing  system  for  digital  video.  Unlike  most 
retrieval  problems,  which  deal  with  clean,  full  text,  video  indexing  based  on  automatically 
extracted  text  has  several  interesting  problems.  First,  we  expect  missing  or  incorrect  characters 
or  even  words  because  of  poor  OCR  performance.  As  a  result,  exact  matches  between  words  will 
not  be  possible.  We  need  to  use  approximate  word  matching  instead  of  exact  word  matching. 
For  example,  if  the  user  submits  “house”  as  a  query,  the  word  “hose”  in  the  database  wiU  be 
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considered  as  a  match. 

The  second  problem  is  that  text  in  digital  videos  is  nsnally  very  terse  and  may  lack  semantic 
breadth.  Methods  that  deal  with  semantic  indexing,  snch  as  Latent  Semantic  Indexing  (LSI), 
need  extensive  training  data  that  has  similar  characteristics.  Intuitively,  constructing  a  semantic 
dictionary  might  be  a  useful  approach,  but  a  great  deal  of  work  would  be  required,  making  this 
impractical.  If  we  consider  only  specihc  topics,  snch  as  news,  hnance,  or  sports,  it  should  be 
possible  for  ns  to  build  a  semantic  dictionary  related  to  each  specihc  topic.  For  example,  if  a 
user  submits  “Financial”  as  a  query,  video  frames  containing  “Stock”  can  be  returned  if  we  put 
this  into  the  semantic  dictionary.  We  are  actively  investigating  this  topic  and  will  report  the 
research  results  in  the  future. 

7  Conclusions 

We  have  presented  a  system  for  detecting  and  tracking  text  in  digital  video  automatically. 
A  hybrid  wavelet /neural  network  based  method  is  used  to  detect  text  regions.  The  tracking 
module  uses  SSD-based  image  matching  to  hud  an  initial  position,  followed  by  contour-based 
stabilization  to  rehne  the  matched  position.  The  system  can  detect  graphical  text  and  scene 
text  with  different  font  sizes  and  can  track  text  that  undergoes  complex  motions. 

We  have  also  discussed  the  OCR  and  indexing  problems  in  video  images.  Onr  current 
focus  is  on  making  use  of  recognized  text  to  build  a  text-based  video  indexing  and  retrieval 
system.  The  experiments  we  have  conducted  suggest  that  text  enhancement  is  necessary  for 
reasonable  OCR  results.  Steps  should  be  taken  to  deal  with  the  poor  OCR  results  and  extend 
their  semantic  breadth. 
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